Storytelling with Data
Posted on Thu 17 January 2019 in posts
The main components of storytelling with data are:¶
- Understanding and setting the context
- Exploring multiple angles
- Using compelling visualizations
- Using varied data sources
- Having a consistent narrative
Collecting data and finding supplementary data¶
Let's take a look at SAT score data in New York City. NYC OpenData provides several different datasets on their website for the public to use. The core of this project will be based around two datasets: SAT Results and the NYC High School Directory.
Using only these two datasets may be enough to produce a basic analysis. However, in order to provide more depth to the topic and to enhance the analysis, we should look to supplement the core data with other relevant datasets. Incorporating a multitude of data to support a compelling analysis will help drive home the eventual results.
In this case, there are several related data sets we can use from the website that cover demographic information and test scores.
We will use the following datasets in this analysis:
- SAT results by school — SAT results for each high school in New York City.
- NYC high school directory — Directory for each high schooo in New York City.
- School attendance — attendance information on every school in NYC.
- Math test results — math test results for every school in NYC.
- Class size — class size information for each school in NYC.
- AP test results — Advanced Placement exam results for each high school. Passing AP exams can get you college credit in the US.
- Graduation outcomes — percentage of students who graduated, and other outcome information.
- Demographics — demographic information for each school.
- School survey — surveys of parents, teachers, and students at each school.
- School district maps — contains information on the layout of the school districts, so that we can map them out.
These datasets are interrelated and therefore we can combine them before doing any analysis.
Getting background information¶
Before diving into analyzing the data, it's useful to research some background information. In this case, we know a few facts that will be useful:
- New York City is divided into
5boroughs, which are essentially distinct regions. - Schools in New York City are divided into several school district, each of which can contains dozens of schools.
- Not all the schools in all of the datasets are high schools, so we'll need to do some data cleaning.
- Each school in New York City has a unique code called a
DBN, or District Borough Number. - By aggregating data by district, we can use the district mapping data to plot district-by-district differences.
Exploratory data analysis¶
The first step to understand the data is to do some exploratory data analysis, so let's examine the datasets we've gathered above.
NYC OpenData makes most of their data sets available via API on their website. The data is available in both JSON and CSV format. For example, the data for SAT results is available in CSV format from this link. For the purposes of this analysis, we will pull from the API in CSV format.
First, let's create a dictionary to store each data set and API link. Then we can loop through that dictionary to read each file into a pandas DataFrame.
import pandas as pd
import numpy as np
files = {'sat_results':'sat_results.csv',
'hs_directory':'hs_directory.csv',
'math_results':'math_results.csv',
'class_size':'class_size.csv',
'ap_results':'ap_results.csv',
'grad_outcomes':'grad_outcomes.csv',
'demographics':'demographics.csv'
}
# Put data from each link into a DataFrame then store the data in
# a new dictionary called 'data'
data = {}
for key, value in files.items():
d = pd.read_csv("/Users/swagbird/Desktop/Data Science/Projects/"\
"Dataquest Portfolio Series/Storytelling with Data"\
"/data/{0}".format(value))
data[key] = d
for k,v in data.items():
print(v.shape)
# Print columns and the first 5 lines of each DataFrame in our dictionary
for k,v in data.items():
print("\n" + k + "\n")
print(v.columns)
print(v.head())
Unifying the data¶
Now that we have all our datasets imported into DataFrames. Let's try to unify them all into one. Looks like most of the datasets have a DBN column. DBN serves as a unique identifier for each school and it looks like it's present in most of our datasets. Let's do a quick check:
# Print columns and the first 5 lines of each DataFrame in our dictionary
for k,v in data.items():
if 'DBN' in v.columns:
print(k)
DBN is missing from two datasets: hs_directory and class_size. In hs_directory the DBN column is named dbn, so we just need to rename that column. Let's find out how we can link class_size back to the other six datasets.
The DBN column looks like this:
data['demographics'].DBN.head()
What does the class_size dataset look like?
data['class_size'].head()
Comparing the DBN column to the class_size data, it looks like DBN is a combination of the first three columns: CSD, BOROUGH, and SCHOOL CODE. DBN stands for "District Borough Number", so it seems that CSD is the "District", BOROUGH is the "Borough", and SCHOOL CODE is the "Number". It looks like BOROUGH is already contained within SCHOOL CODE, and therefore, DBN can be formulated by combining CSD with SCHOOL CODE.
Let's add the DBN column to class_size and rename it in hs_directory:
data['class_size']["DBN"] = data['class_size'].apply(lambda x: '{0:02d}{1}'.format(x['CSD'], x['SCHOOL CODE']), axis=1)
data['class_size'].head(1)
data['hs_directory'].rename(columns={'dbn':'DBN'}, inplace=True)
data['hs_directory'].head(1)
Including survey data¶
The survey data is split up between two files: data for all schools and data for school district 75. We will create a flag for district 75 schools, and then we will concatenate these two datasets into one combined DataFrame. Afterwards, we can add this new dataset to our data dictionary.
survey_all = pd.read_csv('/Users/swagbird/Desktop/Data Science/Projects/Dataquest Portfolio Series/Storytelling with Data/data/survey_all.csv', skiprows=2, encoding='windows-1252')
survey_d75 = pd.read_csv('/Users/swagbird/Desktop/Data Science/Projects/Dataquest Portfolio Series/Storytelling with Data/data/survey_d75.csv', skiprows=2, encoding='windows-1252')
survey_all['d75'] = False
survey_d75['d75'] = True
survey = pd.concat([survey_all, survey_d75], axis=0, sort = False)
survey.head()
Let's remove unnecessary columns within our new survey dataset.
# Rename DBN column and filter only for important columns in dataset
survey["DBN"] = survey["dbn"]
survey_fields = ["DBN", "rr_s", "rr_t", "rr_p", "N_s",
"N_t", "N_p", "saf_p_11", "com_p_11",
"eng_p_11", "aca_p_11", "saf_t_11",
"com_t_11", "eng_t_11", "aca_t_11",
"saf_s_11", "com_s_11", "eng_s_11",
"aca_s_11", "saf_tot_11", "com_tot_11",
"eng_tot_11", "aca_tot_11",]
survey = survey.loc[:,survey_fields]
survey.shape
# Add survey dataset to our dictionary of datasets
data["survey"] = survey
data["survey"].head()
Condensing datasets¶
In order to combine all of the datasets, there can only be one unique DBN value for each row. Taking a look at the class_size dataset, we see that this is not the case.
data['class_size']['GRADE '].unique()
To condense the class_size dataset, we can filter out values in the GRADE (we only want grades 9-12) and PROGRAM TYPE (we only want general education) columns. Then, we can group by DBN and calculate for the average class_size values for each school.
class_size = data['class_size']
class_size = class_size[class_size['GRADE '] == '09-12']
class_size = class_size[class_size['PROGRAM TYPE'] == 'GEN ED']
class_size = class_size.groupby('DBN').agg(np.mean)
class_size.reset_index(inplace=True)
data["class_size"] = class_size
Next, we will condense the demographics dataset. There is data for multiple years for the same schools, so let's filter by the most recent year available.
data['demographics'] = data['demographics'][data['demographics']['schoolyear'] == 20112012]
data['demographics'].head()
We will need to do the same for math_results. This data is segmented by Grade and Year. Let's choose the highest grade and the latest year.
data["math_results"] = data["math_results"][data["math_results"]["Year"] == 2011]
data["math_results"] = data["math_results"][data["math_results"]["Grade"] == '8']
data['math_results'].head()
grad outcomes is the last dataset that needs to be condensed:
data['grad_outcomes'] = data['grad_outcomes'][data['grad_outcomes']['Demographic'] == 'Total Cohort']
data['grad_outcomes'] = data['grad_outcomes'][data['grad_outcomes']['Cohort'] == '2006']
data['grad_outcomes'].head()
Computing variables¶
Let's compute the total SAT score by adding the values from the individual columns for Critical Reading, Math, and Writing. First, we need to convert the values in these columns from strings to numbers before we can add them together.
cols = ['SAT Critical Reading Avg. Score',
'SAT Math Avg. Score', 'SAT Writing Avg. Score']
for c in cols:
data['sat_results'][c] = pd.to_numeric(data['sat_results'][c], errors='coerce')
data['sat_results']['total_sat_score'] = data['sat_results'][[cols[0], cols[1], cols[2]]].sum(axis=1)
# Drop rows for schools with no SAT results
print(data['sat_results'].shape)
data['sat_results'].dropna(inplace=True)
print(data['sat_results'].shape)
To plot the locations of each school on a map, we need to parse out the coordinate locations for each school.
data['hs_directory']['lat'] = data['hs_directory']['Location 1'].apply(lambda x: float(x.split("\n")[2].strip('()').split(',')[0]))
data['hs_directory']['lon'] = data['hs_directory']['Location 1'].apply(lambda x: float(x.split("\n")[2].strip('()').split(',')[1]))
data['hs_directory'][['lat', 'lon']].head()
for k, v in data.items():
print(k)
print(v.head())
Combining the datasets¶
# Combine the datasets using the DBN column
sep_data_names = [k for k in data]
sep_data = [data[k] for k in sep_data_names]
combined = sep_data[0]
print(sep_data_names)
for i, f in enumerate(sep_data[1:]):
name = sep_data_names[i+1]
print(name, f.shape)
# number of non-unique DBN numbers in each dataset
print(len(f["DBN"]) - len(f["DBN"].unique()))
join_type = "left"
if name in ["grad_outcomes","ap_results"]:
join_type = "outer"
combined = combined.merge(f, on="DBN", how=join_type)
combined.shape
cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']
for col in cols:
combined[col] = pd.to_numeric(combined[col])
combined[cols] = combined[cols].fillna(value=0)
# Generate school_dist column with the school distrcit for later plotting
combined["school_dist"] = combined["DBN"].apply(lambda x: x[:2])
# Fill in missing values in the columns with the mean of the column
combined = combined.fillna(combined.mean())
combined.sample(5)
Compute correlations¶
Use the corr method to compute correlations for total_sat_score. A value closer to 1 means there's a stronger positive correlation, and a value closer to -1 means there's a stronger negative correlation. A value close to 0 means a weak correlation.
# Compute correlations for the total SAT score
combined.corr()['total_sat_score'].sort_values(ascending = False)
Set the context¶
# Set up a map centered on NYC and add a marker on the map for each
# high school in the city
import folium
from folium import plugins as fp
schools_map = folium.Map(location=[combined['lat'].mean(), combined['lon'].mean()], zoom_start=10)
marker_cluster = fp.MarkerCluster().add_to(schools_map)
for name, row in combined.iterrows():
folium.Marker([row["lat"], row["lon"]], popup="{0}: {1}".format(row["DBN"], row["school_name"])).add_to(marker_cluster)
schools_map.save('schools.html')
schools_map
# Create heatmap to visualize where the biggest concentrations of schools are
schools_heatmap = folium.Map(location=[combined['lat'].mean(), combined['lon'].mean()], zoom_start=10)
schools_heatmap.add_child(fp.HeatMap([row.lat, row.lon] for name, row in combined.iterrows()))
schools_heatmap.save("heatmap.html")
schools_heatmap
# Compute SAT score by school district and convert school_dist to match geographic district data
district_data = combined.groupby("school_dist").agg(np.mean)
district_data.reset_index(inplace=True)
district_data["school_dist"] = district_data["school_dist"].apply(lambda x: str(int(x)))
# Create function for district map
def show_district_map(col):
geo_path = '/Users/swagbird/Desktop/Data Science/Projects/Dataquest Portfolio Series/Storytelling with Data/data/districts.geojson'
district_map = folium.Map(location=[district_data['lat'].mean(), district_data['lon'].mean()], zoom_start=10)
folium.Choropleth(geo_data=geo_path,
name='choropleth',
data=district_data,
columns = ['school_dist', col],
key_on='feature.properties.school_dist',
fill_color='BuGn',
fill_opacity=0.7,
line_opacity=0.2
).add_to(district_map)
district_map.save("districts_{0}.html".format(col))
return district_map
show_district_map("total_sat_score")
Exploring enrollment and SAT scores¶
import matplotlib.pyplot as plt
combined.plot.scatter(x='total_enrollment', y='total_sat_score')
plt.show()
combined[(combined['total_enrollment'] < 1000) & (combined['total_sat_score'] < 1000)]["School Name"]
combined.plot.scatter(x='ell_percent', y='total_sat_score')
plt.show()
show_district_map("ell_percent")
combined.corr()["total_sat_score"][["rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_tot_11", "com_tot_11", "aca_tot_11", "eng_tot_11"]].plot.bar()
combined.corr()["total_sat_score"][["white_per", "asian_per", "black_per", "hispanic_per"]].plot.bar()
show_district_map("hispanic_per")
combined.corr()["total_sat_score"][["male_per", "female_per"]].plot.bar()
combined.plot.scatter(x='female_per', y='total_sat_score')
combined[(combined['female_per'] > 65) & (combined['total_sat_score'] > 1400)]['School Name']
combined["ap_avg"] = combined["AP Test Takers "] / combined["total_enrollment"]
combined.plot.scatter(x='ap_avg', y='total_sat_score')
combined[(combined['ap_avg'] > 0.3) & (combined['total_sat_score'] > 1700)]['School Name']